{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Truncated Singular Value Decomposition (TSVD) Multi-Node Multi-GPU (MNMG) Demo\n",
    "\n",
    "The TSVD algorithm is a linear dimensionality reduction algorithm that works really well for datasets in which samples correlated in large groups. Unlike PCA, TSVD does not center the data before computation. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unlike the single-GPU implementation, The MNMG TSVD API currently requires a Dask cuDF Dataframe as input. `transform()` also returns a Dask cuDF Dataframe. The Dask cuDF Dataframe API is very similar to the Dask DataFrame API, but underlying Dataframes are cuDF, rather than Pandas.\n",
    "\n",
    "For information on converting your dataset to Dask cuDF format: https://rapidsai.github.io/projects/cudf/en/0.11.0/dask-cudf.html#multi-gpu-with-dask-cudf\n",
    "\n",
    "For more information about cuML's TSVD implementation: https://rapidsai.github.io/projects/cuml/en/0.11.0/api.html#cuml.dask.decomposition.TruncatedSVD"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "\n",
    "import pandas as pd\n",
    "import cudf as gd\n",
    "\n",
    "from cuml.dask.common import to_dask_df\n",
    "from cuml.dask.datasets import make_blobs\n",
    "\n",
    "from dask.distributed import Client, wait\n",
    "from dask_cuda import LocalCUDACluster\n",
    "\n",
    "from dask_ml.decomposition import TruncatedSVD as skTSVD\n",
    "from cuml.dask.decomposition import TruncatedSVD as cumlTSVD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Start Dask Cluster\n",
    "\n",
    "We can use the `LocalCUDACluster` to start a Dask cluster on a single machine with one worker mapped to each GPU. This is called one-process-per-GPU (OPG). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster = LocalCUDACluster(threads_per_worker=1)\n",
    "client = Client(cluster)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_samples = 2**15\n",
    "n_features = 128\n",
    "\n",
    "n_components = 4\n",
    "random_state = 42"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Data\n",
    "\n",
    "### GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "X_dcudf, _ = make_blobs(n_samples, \n",
    "                       n_features, \n",
    "                       centers=2, \n",
    "                       cluster_std=1.0,\n",
    "                       random_state=random_state)\n",
    "\n",
    "wait(X_dcudf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Host"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_ddf = to_dask_df(X_dcudf).to_dask_array(lengths=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scikit-learn Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "tsvd_sk = skTSVD(n_components=n_components,\n",
    "                 algorithm=\"tsqr\", \n",
    "                 random_state=random_state)\n",
    "\n",
    "result_sk = tsvd_sk.fit_transform(X_ddf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## cuML Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "tsvd_cuml = cumlTSVD(n_components=n_components,\n",
    "                     algorithm=\"full\", \n",
    "                     n_iter=5000,\n",
    "                     tol=0.00001,\n",
    "                     random_state=random_state)\n",
    "\n",
    "result_cuml = tsvd_cuml.fit_transform(X_dcudf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate Results\n",
    "\n",
    "### Singular Values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "passed = np.allclose(tsvd_sk.singular_values_, \n",
    "                     tsvd_cuml.singular_values_.to_array(), \n",
    "                     atol=1e-1)\n",
    "print('compare tsvd: cuml vs sklearn singular_values_ {}'.format('equal' if passed else 'NOT equal'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Components"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sk_components = np.abs(tsvd_sk.components_)\n",
    "cuml_components = np.abs(np.asarray(tsvd_cuml.components_.as_gpu_matrix()))\n",
    "\n",
    "passed = np.allclose(sk_components, cuml_components, atol=1e-1)\n",
    "print('compare tsvd: cuml vs sklearn components_ {}'.format('equal' if passed else 'NOT equal'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# compare the reduced matrix\n",
    "passed = np.allclose(result_sk.compute(), np.asarray(result_cuml.compute().as_gpu_matrix()), atol=1)\n",
    "# larger error margin due to different algorithms: arpack vs full\n",
    "print('compare tsvd: cuml vs sklearn transformed results %s'%('equal'if passed else 'NOT equal'))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
